{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Word2Vec is a popular algorithm used for generating dense vector representations of words in large corpora using unsupervised learning. These representations are useful for many natural language processing (NLP) tasks like sentiment analysis, named entity recognition and machine translation.  \n",
    "\n",
    "Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. *SageMaker BlazingText* can learn vector representations associated with character n-grams; representing words as the sum of these character n-grams representations [1]. This method enables *BlazingText* to generate vectors for out-of-vocabulary (OOV) words, as demonstrated in this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Popular tools like [FastText](https://github.com/facebookresearch/fastText) learn subword embeddings to generate OOV word representations, but scale poorly as they can run only on CPUs. BlazingText extends the FastText model to leverage GPUs, thus providing more than 10x speedup, depending on the hardware."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[1] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. \n",
    "- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "isConfigCell": true
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "import boto3\n",
    "import json\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "role = get_execution_role()\n",
    "print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf\n",
    "\n",
    "bucket = sess.default_bucket() # Replace with your own bucket name if needed\n",
    "print(bucket)\n",
    "prefix = 'blazingtext/subwords' #Replace with the prefix under which you want to store the data if needed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Ingestion\n",
    "\n",
    "Next, we download a dataset from the web on which we want to train the word vectors. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence.\n",
    "\n",
    "In this example, let us train the vectors on [text8](http://mattmahoney.net/dc/textdata.html) dataset (100 MB), which is a small (already preprocessed) version of Wikipedia dump.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget http://mattmahoney.net/dc/text8.zip -O text8.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Uncompressing\n",
    "!gzip -d text8.gz -f"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the data downloading and uncompressing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_channel = prefix + '/train'\n",
    "\n",
    "sess.upload_data(path='text8', bucket=bucket, key_prefix=train_channel)\n",
    "\n",
    "s3_train_data = 's3://{}/{}'.format(bucket, train_channel)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's training job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Setup\n",
    "Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "region_name = boto3.Session().region_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, \"blazingtext\", \"latest\")\n",
    "print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the BlazingText model for generating word vectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to the original implementation of [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf), SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to [*BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*](https://dl.acm.org/citation.cfm?doid=3146347.3146354).\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Besides skip-gram and CBOW, SageMaker BlazingText also supports the \"Batch Skipgram\" mode, which uses efficient mini-batching and matrix-matrix operations ([BLAS Level 3 routines](https://software.intel.com/en-us/mkl-developer-reference-fortran-blas-level-3-routines)). This mode enables distributed word2vec training across multiple CPU nodes, allowing almost linear scale up of word2vec computation to process hundreds of millions of words per second. Please refer to [*Parallelizing Word2Vec in Shared and Distributed Memory*](https://arxiv.org/pdf/1604.04661.pdf) to learn more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "BlazingText also supports a *supervised* mode for text classification. It extends the FastText text classifier to leverage GPU acceleration using custom CUDA kernels. The model can be trained on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. For more information, please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html) or [the text classification notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/blazingtext_text_classification_dbpedia/blazingtext_text_classification_dbpedia.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To summarize, the following modes are supported by BlazingText on different types instances:\n",
    "\n",
    "|          Modes         \t| cbow (supports subwords training) \t| skipgram (supports subwords training) \t| batch_skipgram \t| supervised |\n",
    "|:----------------------:\t|:----:\t|:--------:\t|:--------------:\t| :--------------:\t|\n",
    "|   Single CPU instance  \t|   ✔  \t|     ✔    \t|        ✔       \t|  ✔  |\n",
    "|   Single GPU instance  \t|   ✔  \t|     ✔    \t|                \t|  ✔ (Instance with 1 GPU only)  |\n",
    "| Multiple CPU instances \t|      \t|          \t|        ✔       \t|     | |\n",
    "\n",
    "Now, let's define the resource configuration and hyperparameters to train word vectors on *text8* dataset, using \"skipgram\" mode on a `c4.2xlarge` instance.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bt_model = sagemaker.estimator.Estimator(container,\n",
    "                                         role, \n",
    "                                         train_instance_count=1, \n",
    "                                         train_instance_type='ml.c4.2xlarge', # Use of ml.p3.2xlarge is highly recommended for highest speed and cost efficiency\n",
    "                                         train_volume_size = 30,\n",
    "                                         train_max_run = 360000,\n",
    "                                         input_mode= 'File',\n",
    "                                         output_path=s3_output_location,\n",
    "                                         sagemaker_session=sess)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bt_model.set_hyperparameters(mode=\"skipgram\",\n",
    "                             epochs=5,\n",
    "                             min_count=5,\n",
    "                             sampling_threshold=0.0001,\n",
    "                             learning_rate=0.05,\n",
    "                             window_size=5,\n",
    "                             vector_dim=100,\n",
    "                             negative_samples=5,\n",
    "                             subwords=True, # Enables learning of subword embeddings for OOV word vector generation\n",
    "                             min_char=3, # min length of char ngrams\n",
    "                             max_char=6, # max length of char ngrams\n",
    "                             batch_size=11, #  = (2*window_size + 1) (Preferred. Used only if mode is batch_skipgram)\n",
    "                             evaluation=True)# Perform similarity evaluation on WS-353 dataset at the end of training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', \n",
    "                        content_type='text/plain', s3_data_type='S3Prefix')\n",
    "data_channels = {'train': train_data}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have our `Estimator` object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only  remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the `Estimator` classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out `Spearman's Rho` on some pre-selected validation datasets after the training job has executed. This metric is a proxy for the quality of the algorithm. \n",
    "\n",
    "Once the job has finished a \"Job complete\" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bt_model.fit(inputs=data_channels, logs=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hosting / Inference\n",
    "Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bt_endpoint = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting vector representations for words [including out-of-vocabulary (OOV) words]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since, we trained with **```subwords = \"True\"```**, we can get vector representations for any word - including misspelled words or words which were not there in the training dataset.  \n",
    "If we train without the subwords flag, the training will be much faster but the model won't be able to generate vectors for OOV words. Instead, it will return a vector of zeros for such words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Use JSON format for inference\n",
    "The payload should contain a list of words with the key as \"**instances**\". BlazingText supports content-type `application/json`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "words = [\"awesome\", \"awweeesome\"]\n",
    "\n",
    "payload = {\"instances\" : words}\n",
    "\n",
    "response = bt_endpoint.predict(json.dumps(payload))\n",
    "\n",
    "vecs = json.loads(response)\n",
    "print(vecs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, we get an n-dimensional vector (where n is vector_dim as specified in hyperparameters) for each of the words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can evaluate the quality of these representations on the task of word similarity / relatedness. We do so by computing Spearman’s rank correlation coefficient (Spearman, 1904) between human judgement and the cosine similarity between the vector representations.  For English, we can use the [rare word dataset (RW)](https://nlp.stanford.edu/~lmthang/morphoNLM/), introduced by Luong et al. (2013)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget http://www-nlp.stanford.edu/~lmthang/morphoNLM/rw.zip\n",
    "!unzip \"rw.zip\"\n",
    "!cut -f 1,2 rw/rw.txt | awk '{print tolower($0)}' | tr '\\t' '\\n' > query_words.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above command downloads the RW dataset and dumps all the words for which we need vectors in query_words.txt. Let's read this file and hit the endpoint to get the vectors in batches of 500 words [to respect the 5MB limit of SageMaker hosting.](https://docs.aws.amazon.com/sagemaker/latest/dg/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_RequestSyntax)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "query_words = []\n",
    "with open(\"query_words.txt\") as f:\n",
    "    for line in f.readlines():\n",
    "        query_words.append(line.strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "query_words = list(set(query_words))\n",
    "total_words = len(query_words)\n",
    "vectors = {}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import math\n",
    "from scipy import stats\n",
    "\n",
    "batch_size = 500\n",
    "batch_start = 0\n",
    "batch_end = batch_start + batch_size\n",
    "while len(vectors) != total_words:\n",
    "    batch_end = min(batch_end, total_words)\n",
    "    subset_words = query_words[batch_start:batch_end]\n",
    "    payload = {\"instances\" : subset_words}\n",
    "    response = bt_endpoint.predict(json.dumps(payload))\n",
    "    vecs = json.loads(response)\n",
    "    for i in vecs:\n",
    "        arr = np.array(i[\"vector\"], dtype=float)\n",
    "        if np.linalg.norm(arr) == 0:\n",
    "            continue\n",
    "        vectors[i[\"word\"]] = arr\n",
    "    batch_start += batch_size\n",
    "    batch_end += batch_size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have gotten all the vectors, we can compute the Spearman’s rank correlation coefficient between human judgement and the cosine similarity between the vector representations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mysim = []\n",
    "gold = []\n",
    "dropped = 0\n",
    "nwords = 0\n",
    "\n",
    "def similarity(v1, v2):\n",
    "    n1 = np.linalg.norm(v1)\n",
    "    n2 = np.linalg.norm(v2)\n",
    "    return np.dot(v1, v2) / n1 / n2\n",
    "\n",
    "fin = open(\"rw/rw.txt\", 'rb')\n",
    "for line in fin:\n",
    "    tline = line.decode('utf8').split()\n",
    "    word1 = tline[0].lower()\n",
    "    word2 = tline[1].lower()\n",
    "    nwords += 1\n",
    "\n",
    "    if (word1 in vectors) and (word2 in vectors):\n",
    "        v1 = vectors[word1]\n",
    "        v2 = vectors[word2]\n",
    "        d = similarity(v1, v2)\n",
    "        mysim.append(d)\n",
    "        gold.append(float(tline[2]))\n",
    "    else:\n",
    "        dropped += 1\n",
    "fin.close()\n",
    "\n",
    "corr = stats.spearmanr(mysim, gold)\n",
    "print(\"Correlation: %s, Dropped words: %s%%\" % (corr[0] * 100, math.ceil(dropped / nwords * 100.0)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We can expect a Correlation coefficient of ~40, which is pretty good for a small training dataset like text8. For more details, please refer to [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stop / Close the Endpoint (Optional)\n",
    "Finally, we should delete the endpoint before we close the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess.delete_endpoint(bt_endpoint.endpoint)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  },
  "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
